Structur2Vec+DQN[2017 ] Learning Combinatorial Optimization Algorithms over Graphs

https://proceedings.neurips.cc/paper_files/paper/2017/file/d9896106ca98d3d05b8cbdf4fd8b13a1-Paper.pdf

This paper is 2017 NIPS. One of the improtant researchers is Bistra Dilkina, from USC.

Furthermore, existing works typically use the policy gradient for training [6], a method that is not particularly sample-efficient.

They used a single meta-learning algorithm, efficiently learns effective heuristics for each of the problems. They solve three problem: Minimum Vertex Cover (MVC), Maximum Cut (MAXCUT) and Traveling Salesman Problem (TSP). They use a greedy algorithm to solve the probem. A greedy algorithm will construct a solution by sequentially adding nodes to a partial solution S, based on maximizing some evaluation function Q that measures the quality of a node in the context of the current partial solution. A maintenance (or helper) procedure h(S) will be needed, which maps an ordered list S to a combinatorial structure satisfying the specific constraints of a problem. The quality of a partial solution S is given by an objective function c(h(S),G)c(h(S), G) based on the combinatorial structure hh  of S S.

A generic greedy algorithm selects a node v to add next such that v maximizes an evaluation
function,
Q(h(S),v)RQ(h(S), v) \in\mathbb{ R}, which depends on the combinatorial structure h(S)h(S) of the current
partial solution. Then, the partial solution
SS  will be extended as

S:=(S,v),v:=arg maxvSˉQ(h(S),v)S:=(S, v^*),\\v^* :=\argmax_{v\in\bar{S}}Q(h(S), v)

MVC: The helper function does not need to do any work, and c(h(S),G)=Sc(h(S), G) = -|S| The termination
criterion checks whether all edges have been covered.

MAXCUT: The helper function divides V into two sets, S and its complement Sˉ=V/S\bar{S} = V/S, and maintains a cut-set C={(u,v)(u,v)E,uS,vSˉ}C = \{(u, v)|(u, v) \in E,u \in S, v\in\bar{S}\}. Then, the cost is c(h(S),G)=(u,v)Cw(S(i),S(i+1))w(S(S),S(1))c(h(S), G) =\sum_{(u, v)\in C}w(S(i), S(i+1))-w(S(|S|), S(1)), and the termination criterion is activated when S=VS = V . Empirically, inserting a node u in the partial tour at the position which increases the tour length the least is a better choice. We adopt this insertion procedure as a helper
function for TSP

structure2vec

They used the idea of this paper

https://arxiv.org/pdf/1603.05629.pdf

This graph embedding network will compute a p-dimensional feature embedding µvµ_v for each node vVv \in V , given the current partial solution SS. One variant of the structure2vec architecture will initialize the embedding µv(0)µ^{(0)}_v at each node as 0, and for all vVv \in V update the embeddings synchronously at each iteration as

where N(v)N (v) is the set of neighbors of node v in graph G, and F is a generic nonlinear mapping such as a neural network or kernel function.

Parameterizing Q function

design F to update a p-dimensional embedding µvµ_v as

The summation over neighbors is one way of aggregating neighborhood information that is invariant to permutations over neighbors.

Once the embedding for each node is computed after T iterations, we will use these embeddings to define the Q^(h(S),v;Θ)=θ5Trelu([θ6]uVμu(T),θ7μv(T))\hat{Q}(h(S), v;\Theta) = \theta_5^Trelu([\theta_6]\sum_{u\in V}\mu_u^{(T)}, \theta_7\mu_v^{(T)})

[,][·, ·] is the concatenation operator

Training RL

Actions: an action v is a node of G that is not part of the current state S. Similarly, we will represent actions as their corresponding p-dimensional node embedding µv, and such a definition is applicable across graphs of various sizes;

Rewards: the reward function r(S,v)r(S, v) at state S is defined as the change in the cost function after
taking action v and transitioning to a new state
S:=(S,v)S' := (S, v). That is

r(S,v)=c(h(S),G)r(S, v) = c(h(S'), G)

Policy: based on Qb, a deterministic greedy policy π(vS):=arg maxvSˉQ^(h(S),v)\pi(v|S) := \argmax_{v'\in\bar{S}} \hat{Q}(h(S), v') will be
used.

Experiment